What are the Most Valued Data Science Skills?

By Meaghan, Albert, Hovig, Justin, Rose and Brian

March 25, 2018

Introduction

Kaggle ML and Data Science Survey, 2017

5 months ago Kaggle, a website that offers competitions to teams of data scientists for cash prizes, released their annual user survey. This comprehensive survey asked numerous questions to the Kaggle members in order to collect metrics on it’s user base. Our group selected this data to serve as our data set for determining the top hard and soft skills required for a data scientist.

Breakdown:

  • Nearly 3000 observations from a larger raw set of nearly 16000 observations subset to find working data scientists.
  • More than 200 questions on a variety of topics.

Credit to Amber Thomas for providing the following code used for extracting and summarizing answers to multiple-choice questions.

chooseOne = function(question){
    exp_df %>%
        filter(!UQ(sym(question)) == "") %>% 
        dplyr::group_by_(question) %>% 
        dplyr::summarise(count = n()) %>% 
        dplyr::mutate(percent = (count / sum(count)) * 100) %>% 
        dplyr::arrange(desc(count)) 
}

chooseMultiple = function(question,df){
  df %>% 
    dplyr::filter(!UQ(sym(question)) == "") %>%
    dplyr::select(question) %>% 
    dplyr::mutate(totalCount = n()) %>% 
    dplyr::mutate(selections = strsplit(as.character(UQ(sym(question))), 
                                 '\\([^)]+,(*SKIP)(*FAIL)|,\\s*', perl = TRUE)) %>%
    unnest(selections) %>% 
    dplyr::group_by(selections) %>% 
    dplyr::summarise(totalCount = max(totalCount),
              count = n()) %>% 
    dplyr::mutate(percent = (count / totalCount) * 100) %>% 
    dplyr::arrange(desc(count))
}        

Academic_exploration=function(question,df){
     df %>%
        filter(!UQ(sym(question)) == "") %>% 
        dplyr::group_by_(question) %>% 
        dplyr::summarise(count = n()) %>% 
        dplyr::mutate(percent = (count / sum(count)) * 100) %>% 
        dplyr::arrange(desc(count)) 
  }

proportion_function <- function(vec){
    vec/sum(vec)*100
}

create_breaks <- function(dfcolumn,breaks,labels){
    dfcolumn <- as.numeric(dfcolumn)
    dfcolumn <- cut(dfcolumn,breaks=breaks,labels=labels,right=FALSE)
}

Profile of a Data Scientist

Data Scientist Demographics

INTRODUCTION HERE

<<<<<<< HEAD

CONCLUSION

Learning Platform Usefulness

Usefulness of Various Learning Platforms

INTRODUCTION

CONCLUSION

Learning Categories

How Data Scientists Learned Their Core Skills

In this section, we examine how data scientists gained their skill set. We believe there may be valuable insight in what makes a strong data scientist by examing how successful data scientists gained their skill set.

Our data shows a great diversity in learning styles. This indicates that not only do data scientists learn from a variety of sources, but every data scientist’s sources vary in importance. This highlights the idea that there is no right or wrong way to learn to become a data scientist. At the same time, as the four major categories amount for nearly 100% of education, this means that there are no “secret” learning sources.

It is interested to note that nearly 75% of data scientists indicate they learned while on the job.

Common Job Algorithms

Common Alogrithms and Methods Used by Data Scientists

INTRODUCTION

In this section, we explore commonly used algorithms and methods that are presumably required as basic skills in data science field.

CONCLUSION

It appears that on average, data scientists use at least 3 algorithms and 7 methods in their work. As the bar graph shows above, the most commonly used algorithms and methods as follows:

  • Algorithm
    • Regression/Logistic Regression (15.65%)
    • Decision Trees (12.96%)
    • Random Forests (11.7%)
  • Methods
    • Data Visualization (8%)
    • Logistic Regression (6.83%)
    • Cross-validation (6.74%)
    • Decison Tress (5.93%)
    • Random Forests (5.63%)
    • Neural Networks (5.28%)
    • Time Series Analysis (5.03%)

An average data scientist is able to the above listed algorithms and methods as basic hard skills to meet the standard industry expectation. An exceptional data scientist may be capable of handling 7 to 30 methods and 4 to 15 algorithms.

Furthermore, the most commonly used size of dataset appears to fall in the 1GB ~ 10GB range ( > 50%). For reference, the last graph displays the most used methods by size of dataset.

Work Tools Freqeuncy

Frequency of Use for Various Tools by Data Scientists

CONCLUSION

Work Challenges

Challenges Faced by Data Scientists

In this section, we address the challenges faced by Data Scientists, and how their time is typically spent at work.

CONCLUSION

Conclusion

Skill List

Data Scientists…

HARD SKILLS

  1. List

SOFT SKILLS

  1. learn from diverse sources.
  2. continue to learn even after they have secured a job.

In conclusion…

SQL

The original kaggle data was in an untidy form. As part of data preparation we each created tidy data sub sets and saved them to a series of csv files stored on our github. The following SQL script will import them into a series of tables. We hope that this will aid future research and help to find connections that we may have missed.

holding
=======

Learning Platform Usefulness: Hovig

Learning Categories: Brian

This subset of data examines how data scientists learned their core skill set. Each data scientist was asked to assign each category a percent from 0 to 100 indicating how much of their education was made up of this source.

First, a list of categories was extrated and formatted.

tidy.names <- names(raw.data)[61:66] %>% 
  str_extract('(?<=LearningCategory)(\\w+)') %>% 
  str_replace_all('(?<=[a-z])([A-Z])', '_\\1') %>% 
  tolower()
tidy.names %>% kable()
x
self_taught
online_courses
work
university
kaggle
other

The data was tidied and the categories were converted to factors, to aid in analysis.

tidy.data <- raw.data %>%
  select(c(1, 61:66)) %>%
  setNames(c('id', tidy.names)) %>%
  gather('category', 'percent', 2:7, na.rm=TRUE)

tidy.data$percent %<>% as.numeric()

tidy.data$category %<>% factor(levels=tidy.names, ordered=TRUE)
tidy.data %>% head(10) %>% kable()
id category percent
1 self_taught 0
2 self_taught 30
3 self_taught 50
4 self_taught 30
5 self_taught 10
6 self_taught 30
7 self_taught 30
8 self_taught 40
9 self_taught 15
10 self_taught 80

Summary statistics tell an intersting story. No source averaged more than 50% of the sets learning sources. This would seem to indicate that data scientists learn from a diverse set of sources. The ‘other’ category’s mean is nearly 0 also indicating that the other categories account for nearly all learning sources.

tidy.summary.data <- tidy.data %>% 
  group_by(category) %>% 
  summarise(avg=mean(percent), sd=sd(percent))
tidy.summary.data %>% kable()
category avg sd
self_taught 35.144046 25.031918
online_courses 20.913201 21.314789
work 19.651230 19.842592
university 17.970174 22.666613
kaggle 5.494971 10.822363
other 1.031872 6.140186

The boxplots support the summary statistics. Each category has numerous upper ourliers indicating that data scientists who learned most or entirely from one source were rare.

ggplot(tidy.data) +
  geom_boxplot(aes(category, percent)) +
  xlim(tidy.names %>% rev()) +
  coord_flip() + 
  labs(x='Learning Source', 
       y='Proportion',
       title='Data Scientists Learn From Diverse Sources'
  )

The final more clearly shows the diversity in learning styles. This indicates that not only do data scientists learn from a variety of sources, but every data scientist’s sources vary in importance. This highlights the idea that there is not right or wrong way to learn to become a data scientist. At the same time, as the four major categories amount for nearly 100% of education, this means that there are no “secret” learning sources.

ggplot(tidy.data) +
  geom_bar(aes(category, fill=percent %>% 
                                round_any(10) %>% 
                                factor(seq(0, 100, 10))
              ), position=position_fill(reverse=TRUE)
          ) +
  scale_color_brewer(palette='Set1') + 
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  labs(x='Learning Source', 
       y='Proportion',
       title='Data Scientists Learn From Diverse Sources',
       fill='Percent'
  )

Common Job Algorithms: Rose

This subset of data examines common algorithms and methods used by data scientists.

First, the proper data is subset.

data.rose <- raw.data %>%
  select(c(1, 80:81, 134:167))

tidy.names <- c(names(data.rose)[1:4], 
                names(data.rose)[5:37] %>% 
                  str_extract('(?<=WorkMethodsFrequency)(.+)')
                )

melt.dt <- data.rose %>%
  setNames(tidy.names) %>%
  gather('WorkMethodsFrequency', 'Frequency', 5:37)

The data on commonly used algorithms was seperated into it’s own table.

alg.select <- melt.dt %>%
  select(c('id', 'WorkAlgorithmsSelect'))
alg.select.list <- alg.select$WorkAlgorithmsSelect %>%
  strsplit(split = ",")
alg.select.dt <- tibble(id = rep(alg.select$id, sapply(alg.select.list, length)), 
                            algorithm = unlist(alg.select.list))
alg.select.dt %>% head(10) %>% kable()
id algorithm
1 Neural Networks
1 Random Forests
1 RNNs
2 Bayesian Techniques
2 Decision Trees
2 Random Forests
2 Regression/Logistic Regression
3 Bayesian Techniques
3 Regression/Logistic Regression
4 CNNs

The data on commonly used method was seperated into it’s own table.

method.select <- melt.dt %>%
  select(c('id', 'WorkMethodsSelect'))
method.select.list <- method.select$WorkMethodsSelect %>%
  as.character() %>% 
  strsplit(split = ",")
method.select.dt <- tibble(id = rep(method.select$id, sapply(method.select.list, length)), 
                               method = unlist(method.select.list))
method.select.dt %>% head(10) %>% kable()
id method
1 Association Rules
1 Collaborative Filtering
1 Neural Networks
1 PCA and Dimensionality Reduction
1 Random Forests
2 A/B Testing
2 Bayesian Techniques
2 Data Visualization
2 Decision Trees
2 Ensemble Methods

Finally, the data on fequency of each method was seperated into it’s own table.

freq.dt <- melt.dt %>%
  select(c('id', 'WorkDatasetSize', 'WorkMethodsFrequency', 'Frequency'))
freq.dt %>% head(10) %>% kable()
id WorkDatasetSize WorkMethodsFrequency Frequency
1 10GB A/B NA
2 1GB A/B Sometimes
3 100GB A/B Often
4 1GB A/B NA
5 1GB A/B Rarely
6 100MB A/B NA
7 1TB A/B NA
8 10TB A/B NA
9 NA A/B NA
10 10GB A/B NA
alg.select.dt <- as.data.table(alg.select.dt)
alg.total <- nrow(na.omit(alg.select.dt))
alg.vis <- na.omit(alg.select.dt)[, .(count = length(id)), by = .(algorithm)][order(-count)]
alg.vis$perc <- paste0(round((alg.vis$count / alg.total) * 100, 2), "%")

ggplot(alg.vis, aes(reorder(algorithm, count), count, fill = algorithm)) + 
  geom_text(aes(label = perc), hjust = -0.5, size = 3, color = "black") +
  guides(fill=FALSE) +
  geom_bar(stat = 'identity') +
  coord_flip() + 
  labs(title = "Algorithm used by data scientists",
       x = "Algorithm",
       y = "Proportion") 

method.select.dt <- as.data.table(method.select.dt)
method.total <- nrow(na.omit(method.select.dt))
method.vis <- na.omit(method.select.dt)[, .(count = length(id)), by = .(method)][order(-count)]
method.vis$perc <- paste0(round((method.vis$count / method.total) * 100, 2), "%")

ggplot(method.vis, aes(reorder(method, count), count, fill = method)) + 
  geom_text(aes(label = perc), hjust = -0.5, size = 3, color = "black") +
  guides(fill=FALSE) +
  geom_bar(stat = 'identity') +
  coord_flip() + 
  labs(title = "Method used by data scientists",
       x = "Method",
       y = "Proportion") 

freq.dt <- as.data.table(freq.dt)
freq.dt <- freq.dt[, .(count = .N), by = .(Frequency, WorkMethodsFrequency,WorkDatasetSize)]
method.freq <- freq.dt[, .(count = sum(count)), by = .(WorkMethodsFrequency, Frequency)][order(-count)]
size.freq <- freq.dt[, .(count = sum(count)), by = .(WorkDatasetSize, Frequency)][order(-count)]

ggplot(na.omit(size.freq), aes(reorder(WorkDatasetSize, count), count, fill = WorkDatasetSize)) +
  guides(fill=FALSE) +
  geom_bar(stat = 'identity') +
  theme(axis.text.x = element_text(angle= 90, hjust=1)) +
  coord_flip() +
  facet_wrap(~Frequency) +
  labs(title = "Datasetsize used by frequency",
       x = "Dataset Size",
       y = "Count")

ggplot(na.omit(freq.dt[count>25]), aes(reorder(WorkMethodsFrequency, count), count, fill = WorkMethodsFrequency)) +
  guides(fill=FALSE) +
  geom_bar(stat = 'identity') +
  theme(axis.text.x = element_text(angle= 90, hjust=1)) +
  coord_flip() +
  facet_wrap(~WorkDatasetSize) +
  labs(title = "Method Used Per Data Size",
       x = "Work Methods",
       y = "Count")

Work Tools Freqeuncy: Meaghan

The data was very untidy and expanded into two columns passed the “WorkToolsSelect” column. I brought in the 3 columns and replaced elements to ensure a easy split

tidy.names <- names(raw.data)[83:132]%>% 
  str_extract('(?<=WorkToolsFrequency)(\\w+)') %>% 
  str_replace_all('(?<=[a-z])([A-Z])', '_\\1') 

tools.data <- raw.data %>%
  select(c(1, 82:84)) %>%
  setNames(c('id', 'tool_used', "temp_1", "temp_2"))%>%
  unite_("tool_used", c("tool_used","temp_1","temp_2"))%>%
  mutate(tool_used = (str_replace_all(tool_used, '/', ',')),
         tool_used = (str_replace_all(tool_used, '_', ',')))%>%
  mutate(tool_counter =1)
tools.data <- cSplit(tools.data, 'tool_used', ',')

I used the gather function to reformat the table

id.tool.df <- tools.data %>%
  gather(tool_group, tool, names(tools.data)[3:63])%>%
  group_by(id, tool)%>%
  summarise(sum_tool = sum(tool_counter))%>%
  drop_na()%>%
  filter(!tool %in% c("Rarely", "Often",
                      "Sometimes", "Most of the time"))
id.tool.df %>% head(10) %>% kable()
id tool sum_tool
1 Amazon Web services 1
1 Oracle Data Mining 1
1 Oracle R Enterprise 1
1 Perl 1
2 Amazon Machine Learning 1
2 Amazon Web services 1
2 Cloudera 1
2 Hadoop 1
2 Hive 1
2 Impala 1

I created a summary table representing the frequency of each response.

summary.tool.df <- tools.data %>%
  gather(tool_group, tool, names(tools.data)[3:63])%>%
  group_by(tool)%>%
  summarise(sum_tool = sum(tool_counter))%>%
  drop_na()%>%
  arrange(desc(sum_tool))%>%
  filter(!tool %in% c("Rarely", "Often",
                      "Sometimes", "Most of the time"))%>%
  mutate(percent_total = round((sum_tool/ sum(sum_tool))*100,digits = 2))
summary.tool.df %>% head(10) %>% kable()
tool sum_tool percent_total
Python 1800 10.82
R 1424 8.56
SQL 1130 6.79
Jupyter notebooks 945 5.68
TensorFlow 758 4.56
C 610 3.67
C++ 610 3.67
MATLAB 561 3.37
Octave 561 3.37
awk 545 3.28

This plot shows the top data science skills given the filters used.

ggplot(head(summary.tool.df,15), aes(x=reorder(tool, -sum_tool), y=percent_total)) + 
  geom_bar(stat="identity", width=.5, fill="tomato3") +
  geom_text(aes(label=percent_total))+
  labs(x='Tool', 
       y='Percent Total',
       title="Top 15 Data Science Tools", 
       caption="Source: Multiple Choice Responses") + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6))

A data frame of tool frequency by id was created.

frequency.data <- raw.data %>%
  select(c(1, 83:132)) %>%
  setNames(c('id', tidy.names))

id.frquency.table <-frequency.data %>% 
  gather(tool_name, frequency_id, names(frequency.data)[2:51])%>%
  filter(frequency_id %in% c("Rarely", "Often",
                             "Sometimes", "Most of the time"))%>%
  arrange(id)
id.frquency.table %>% head(10) %>% kable()
id tool_name frequency_id
1 AWS Rarely
1 Oracle Sometimes
1 Perl Most of the time
2 Amazon_ML Rarely
2 AWS Often
2 Cloudera Rarely
2 Hadoop Rarely
2 Impala Rarely
2 Java Rarely
2 Mathematica Rarely

I grouped the frequency information by the actual tool name & response

summary.frquency.table <- frequency.data %>% 
  gather(tool_name, frequency_id, names(frequency.data)[2:51])%>%
  filter(frequency_id %in% c("Rarely", "Often",
                             "Sometimes", "Most of the time"))%>%
  mutate(freq_counter =1) %>%
  group_by(tool_name,frequency_id)%>%
  summarise(sum_feq = sum(freq_counter))%>%
  arrange(desc(sum_feq))
summary.frquency.table %>% head(10) %>% kable()
tool_name frequency_id sum_feq
Python Most of the time 1033
R Most of the time 646
SQL Most of the time 521
Python Often 389
Jupyter Most of the time 371
R Often 340
SQL Often 302
R Sometimes 299
Python Sometimes 277
Jupyter Often 274

The plot shows frequency of use of top technologies.

ordering <- c('Most of the time', 'Often', 'Sometimes', 'Rarely')

ggplot(head(summary.frquency.table,50), aes(x = frequency_id, y = sum_feq, fill = tool_name)) + 
  geom_bar(stat = "identity") + 
  facet_wrap(~tool_name) + 
  ylab("Number of times a response was selected") + 
  xlim(ordering) +
  theme(legend.position="none") +
  theme(axis.text.x = element_text(angle = 90, 
                                   vjust = 0.5, 
                                   hjust = 1))

Work Challenges: Albert

What Challenges ata Scientists Experience?

In this section, we address the challenges faced by Data Scientists, and how their time is typically spent at work. We believe that the time spent performing data science related tasks and their respective challenges will provide useful insights on the skills necessary to succeed as a data scientist.

challenges <- raw.data %>% 
  select(id, "WorkChallengesSelect") %>% 
  cSplit("WorkChallengesSelect", sep = ",", direction = "long")
challenges.names <- raw.data %>%
  select(starts_with("WorkChallenge"), -WorkChallengesSelect) %>%
  names() %>%
  str_extract('(?<=WorkChallengeFrequency)(\\w+)') %>% 
  str_replace_all('(?<=[a-z])([A-Z])', '_\\1') %>%
  tolower()

challenges.frequency <- raw.data %>%  
  select(id, starts_with("WorkChallenge"), -WorkChallengesSelect) %>%
  setNames(c('id', challenges.names)) %>%
  gather("WorkChallengeFrequency", "Frequency", -id )
time.names <- raw.data %>%
  select(starts_with("Time"), - TimeSpentStudying) %>%
  names() %>%
  str_extract('(?<=Time)(\\w+)') %>% 
  str_replace_all('(?<=[a-z])([A-Z])', '_\\1') %>%
  tolower()

time.spent <- raw.data %>%  
  select(id, starts_with("Time"), - TimeSpentStudying) %>%
  setNames(c('id', time.names)) %>%
  gather("Activity", "Time", -id )
filter( time.spent, !is.na(`Time`), `Time` != 'NA') %>% 
  mutate(`Time` = as.integer(`Time`)) %>% 
  group_by(`Activity`) %>% 
  summarise(`Time`= round(mean(`Time`), digits = 2)) %>% 
    ggplot(aes( x = reorder(Activity, Time), y = Time,fill=Activity, label = `Time`) ) + 
      geom_bar(stat = "identity", show.legend = F) + 
      geom_text(size = 2, position = position_stack(vjust = 0.5)) + 
      coord_flip() +  
      labs(title = "Distribution of Time Spent", x = "Activity")

  filter(challenges, !is.na(`WorkChallengesSelect`)) %>% 
  group_by(`WorkChallengesSelect` ) %>%
  summarise( Count = n()/nrow(filter(raw.data,!is.na(`WorkChallengesSelect`)) ) * 100 ) %>%
  ggplot(aes(x = reorder(WorkChallengesSelect, Count), y = Count,  fill=WorkChallengesSelect, label = round(Count))) + 
  geom_histogram( stat='identity', show.legend = F ) + 
  geom_text(size = 2, position = position_stack(vjust = 0.5)) +
  coord_flip() +  
  labs( title = "Most Time is Spent Organizing Data", x = "Challenges", y = "%" )

freq.data <- filter( challenges.frequency, !is.na(`Frequency`) ) %>%
  mutate(`WorkChallengeFrequency` = str_replace(`WorkChallengeFrequency`, "WorkChallengeFrequency", ""))  %>% 
  group_by(`WorkChallengeFrequency`, `Frequency`) %>% 
  summarise(`Count`= n()) %>% 
  mutate(`Ratio` = round(  ( `Count` / sum( `Count` ) ) * 100, digits = 2) )
freq.data$Frequency <- factor( freq.data$Frequency, levels = c("Most of the time", "Often" , "Sometimes", "Rarely") )
  
  ggplot(freq.data, aes(x = WorkChallengeFrequency, y = Ratio, fill = Frequency,label = Ratio ) ) + 
  geom_bar( stat = "identity" ) + 
  geom_text(size = 2, position = position_stack(vjust = 0.5)) + 
  coord_flip() +  
  scale_fill_brewer(palette = 'RdYlBu') + 
  labs( title = "Majority of the Challenges are Faced Often to  Most of the Time", x = "Challenge", y = "Frequency" )

The data shows that data scientists spend a whopping 34% of their time gathering a cleaning data. Almost 25% of their time is spent selecting/building models, and 27% of their time is spent visualizing, discovering, and communicating insights to stakeholders. This is evidence that data scientists must have superb data cleaning and modeling skills. Data scientists must be able to visually and verbally communicate their findings to stakeholders.

Interestingly, dirty data is the most prevalent challenge, at 48%. A staggering 39% of data scientists were challenged by issues related to company politics and financial/management support. Interpersonal skills are vital in navigating office politics. Technical writing skills may aid in drafting proposals for financial support. 31% of respondents reported challenges with data access and availability, therefore advanced data acquisition skills is an for data scientists.

24% of responders reported issues of unused results from data science projects. This is alarmning, given that data science can be very expensive. Honing communication skills may reduce the proportion of unused results.

One in four data scientists lack a clear question to answer and a direction to take with the data, one in five data scientists reported challenges of explaining data science to others, and one in seven data scientists reported issues with maintaining reasonable expactations for data science projects. These all speak to communication skills and the ingenuity/creativity to frame questions and problems in such a way that will garner proper responses from stakeholders.

It would have been interesting to investigate the relationship between the rate of unused results and the variables related to communication. However the results were internally randomized and results in the same row may not be from the same responder.

The necessary skills based on time distribution and challenges are:

Hard Skills:

  • Data cleaning/management
  • Data Modeling (can be substituted with the results from Rose’s section )
  • Data visualization
  • Data Exploration

Soft Skills

  • Commincation (verbal & Written)
  • Interpersonal
  • Creativity/Ingenuity

Conclusion (writing to csv/importing to sql?)

>>>>>>> 432546488b064d52894f16ca2b300c26c51fed83